---
title: "ACS Tables for CCES and MRP"
author: "Shiro Kuriwaki"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ACS Tables for CCES and MRP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


### Common Misperceptions: Limitations of the ACS for MRP

The ACS is the main dataset used for post-stratification in MRP because of its wide coverage of different geographies. However, there are a couple of data limitations to the ACS that make subsequent MRP diverge from the classic MRP as it is usually taught:

1. The ACS is conducted by the Census Bureau but it __is a survey__, not a population census.  It is an extremely large survey, with over 3 million respondents in any given year. But it is still a sample, and therefore its counts are estimates. The ACS even provides margin of error estimates around its counts.
2. __Microdata is not available__ for some geographies like congressional districts. By microdata, we mean data where each row is an individual respondent and can be aggregated up. The Census page, tidycensus, IPUMS (<https://usa.ipums.org/>) has an amazing interface to download microdata, but that data does not include congressional district. 
   * The Census provides some crosswalk files but they are mainly ZCTA to CD crosswalks. 
   * PUMAs are the main unit of area for the ACS but these do not link well to CD (see [here](https://forum.ipums.org/t/is-there-a-way-of-linking-acs-pums-data-with-voter-districts/611) for discussion and potential fractional matching). For example Cambridge Mass. is a single PUMA, but Cambridge has about 8 zipcodes and is split into parts of MA-05 (Rep. Katherine Clark) and MA-07 (Rep. Ayanna Pressley)
3. The population Census is ideal, but the Census only occurs every 10 years, and only a 5 percent sample of it (with its restrictions of its own) is given to the public with some lag time. Therefore the population Census is often not a good alternative to the ACS / CPS.
4. The ACS does __not ask some key variables__, such as voter registration (which is asked in the CPS) or party identification (which is only asked in political surveys like the CCES or Pew). That limits the variables one could include in the MRP model.
5. Of the variables that are available, __only certain variable combinations__ of them are available at the district level:
   * There are three are three-way interactions of some demographic variables but not more than that. 
   * The ACS provides thousands of cell counts, but only a subset of them are useful for MRP because cells for MRP must form a partition of the population (i.e., exclusive and exhaustive categories). I describe several important combinations below.
   * See the synthetic imputation page  (`vignette("synth", package = "synthjoint")`) in this package to see how this restriction could be relaxed.
6. Some of the partition counts are __not perfect partitions__. For example, some of the variable codes I define below _exclude_ people aged 18-24, include non-citizens, or double-count African Americans who also identify as Hispanic/Latino. These introduce some error into the post-stratification.

This package and vignette goes into more detail about these limitations and the nature of the data that is available.


### Three types of ACS Tables

If you are in a hurry or want something to start with, here are three partitions this package provides:

1. __Age x Sex x Education__. These variable codes are encoded in `?acscodes_age_sex_educ`.
2. __Age x Sex x Race__. These variable codes are encoded in `?acscodes_age_sex_race`.
3. __Sex x Education x Race __. These variable codes are encoded in `?acscodes_sex_educ_race`

See their help pages for more information and their limitations.


### Collapsing variable codes

The survey and the post-stratification must share discrete variables the level of which match 1:1. Therefore, if race is binned 9 ways in the ACS but there are only 5 response options in the CCES, then the finer group must be collapsed to coarser groupings of the 5 levels in the CCES. That also means, for example, that the recoding must be nested -- i.e., it is hard to salvage two variable codings that are not many-to-one nested mappings to each other. These judgments must be made with careful attention to the question wording and are documented to some extent in `?namevalue`. 

### Thanks

The statements in this vignette benefited from many online resources and correspondence with experts including  Yair Ghitza and Matto Mildenberger. I also relies on findings from several published papers:

* Howe, P. D., Mildenberger, M., Marlon, J. R., & Leiserowitz, A. (2015). Geographic variation in opinions on climate change at state and local scales in the USA. _Nature Climate Change_, 5(6), 596–603. https://doi.org/10.1038/nclimate2583
* Kastellec, J. P., Lax, J. R., & Phillips, J. H. (2010). Public Opinion and Senate Confirmation of Supreme Court Nominees. _Journal of Politics_, 72(3), 767–784. https://doi.org/10.1017/S0022381610000150
* Walker, Kyle. (2020). tidycensus: Load US Census Boundary and Attribute Data as 'tidyverse' and 'sf'-Ready Data Frames. R
  package version 0.9.9.2. https://CRAN.R-project.org/package=tidycensus
* Warshaw, C., & Rodden, J. (2012). How should We measure district-level public opinion on individual issues? _Journal of Politics_, 74(1), 203–219. https://doi.org/10.1017/S0022381611001204
